5 research outputs found

    Rebasing Microarchitectural Research with Industry Traces

    Get PDF
    Microarchitecture research relies on performance models with various degrees of accuracy and speed. In the past few years, one such model, ChampSim, has started to gain significant traction by coupling ease of use with a reasonable level of detail and simulation speed. At the same time, datacenter class workloads, which are not trivial to set up and benchmark, have become easier to study via the release of hundreds of industry traces following the first Championship Value Prediction (CVP-1) in 2018. A tool was quickly created to port the CVP-1 traces to the ChampSim format, which, as a result, have been used in many recent works. In this paper, we revisit this conversion tool and find that several key aspects of the CVP-1 traces are not preserved by the conversion. We therefore propose an improved converter that addresses most conversion issues as well as patches known limitations of the CVP-1 traces themselves. We evaluate the impact of our changes on two commits of ChampSim, with one used for the first Instruction Championship Prefetching (IPC-1) in 2020. We find that the performance variation stemming from higher accuracy conversion is significant

    CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

    Get PDF
    Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2- way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation

    The forward slice core : a high-performance, yet low-complexity microarchitecture

    No full text
    Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ideally provide high performance in a power- and cost-efficient manner. Recently proposed slice-out-of-order (sOoO) cores identify backward slices of memory operations which they execute out-of-order with respect to the rest of the dynamic instruction stream for increased instruction-level and memory-hierarchy parallelism. Unfortunately, constructing backward slices is imprecise and hardware-inefficient, leaving performance on the table. In this article, we propose Forward Slice Core (FSC), a novel core microarchitecture that builds on a stall-on-use in-order core and extracts more instruction-level and memory-hierarchy parallelism than slice-out-of-order cores. FSC does so by identifying and steering forward slices (rather than backward slices) to dedicated in-order FIFO queues. Moreover, FSC puts load-consumers that depend on L1 D-cache misses on the side to enable younger independent load-consumers to execute faster. Finally, FSC eliminates the need for dynamic memory disambiguation by replicating store-address instructions across queues. Considering 3-wide pipeline configurations, we find that FSC improves performance by 27.1%, 21.1%, and 14.6% on average compared to Freeway, the state-of-the-art sOoO core, across SPEC CPU2017, GAP, and DaCapo, respectively, while at the same time incurring reduced hardware complexity. Compared to an OoO core, FSC reduces power consumption by 61.3% and chip area by 47%, providing a microarchitecture with high performance at low complexity

    Symbiotic job scheduling on the IBM POWER8

    No full text
    Simultaneous multithreading (SMT) processors share most of the microarchitectural core components among the co- running applications. The competition for shared resources causes performance interference between applications. Therefore, the performance benefits of SMT processors heavily depend on the complementarity of the co-running applications. Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of a processor with SMT cores. Prior work uses sampling or novel hardware support to perform symbiotic job scheduling, which has either a non-negligible overhead or is impossible to use on existing hardware. This paper proposes a symbiotic job scheduler for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to predict symbiosis between applications, and use that information at run-time to decide which applications should run on the same core or on separate cores. We implement the scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogrammed workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. With respect to Linux, it achieves an average speedup by 8.8% for workloads comprising 12 applications, and by 4.7% on average across all evaluated workloads

    VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors

    No full text
    Modern-day graph workloads operate on huge graphs through pointer chasing which leads to high last-level cache (LLC) miss rates and limited memory-level parallelism (MLP). Simultaneous Multi-Threading (SMT) effectively hides the memory access latencies for multi-threaded graph workloads provided that sufficient threads are supported in hardware. Unfortunately, providing a sufficiently large number of physical threads incurs an unjustifiably high hardware cost for commodity SMT processors which typically implement only two physical hardware threads. Ideally, we would like to achieve aggressive-SMT performance when running graph workloads on modest commodity processors. In this paper, we propose Virtualized Multi-Threading (VMT), a low-overhead multi-threading paradigm for accelerating graph workloads on commodity processors. Unlike prior multi-threading paradigms, VMT virtualizes both the physical hardware threads and the architecture state: VMT maps a large number of logical software threads to a small number of physical hardware threads, while maintaining the architecture state of the logical threads in the processor's cache hierarchy. Implemented on top of a quad-core 2-way SMT processor, VMT achieves an average speedup of 1.74x for a set of representative graph workloads, while incurring minimal hardware cost (195 bytes per core to support up to 32 logical threads). VMT's low hardware cost paves the way for implementation in commodity processors
    corecore